In this project, we explore the training of an ML model to predict if a home was built pre-1980 based on the other columns. Datasets can be dirty or missing data, and pre-1980 homes may have asbestos. The goal is to provide a trained model that can predict with an accuracy of at least 90% whether a home was built pre/post 1980 based on the other data.
read data clean data
df = pd.read_csv("dwellings_denver.csv")#clean up dirty datadf['condition'].replace("AVG",'average', inplace=True)df['floorlvl'].replace(np.nan,0,inplace=True)df['gartype'].replace(np.nan,"None", inplace=True)df['pre-1980']=df['yrbuilt'] <1980
Question 1| Relationship Charts
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
evaluate arcstyle
#create dataframe for comparing neighborhood to year builtnbhd_yb = df[['arcstyle','pre-1980']]nbhd_yb = nbhd_yb.sort_values('arcstyle')nbhd_yb['arcstyle'] = nbhd_yb['arcstyle'].astype('string')# show chartchart1 = px.histogram(nbhd_yb, x='arcstyle', color='pre-1980', title='Pre/Post 1980 Homes per Architecture Style', labels={'arcstyle':'Architecture Style'} )chart1.show()
The chart above shows a reasonable correlation between architecture style and year built. Blue bars represent total home count in each style prior to 1980, the stacked red bar above the blue is for after 1980. Clearly most homes prior to 1980 were one-story. However other datapoints are quite split (end unit, middle unit), making this data somewhat useful, but not a great predictor of year built.
evaluate nbhd
nbhd_yb = df[['nbhd','pre-1980']]nbhd_yb = nbhd_yb.sort_values('nbhd')nbhd_yb['nbhd'] = nbhd_yb['nbhd'].astype('string')# show chartchart2 = px.histogram(nbhd_yb, x='nbhd', color='pre-1980', nbins=800, range_y=([0,700]), title='Pre/Post 1980 Homes per Neighborhood', labels={'nbhd':'Neighborhood Code' } )chart2.show()
The chart above shows a strong correlation between neighborhood and year built. Blue bars represent total home count in each neighborhood prior to 1980, the stacked red bar above the blue is for after 1980. Not surprisingly, our bars tend to be mostly red or mostly blue as neighborhoods tend to be built generally during the same time period. This appears to be a very good predictor.
Task 2| Model Building
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
%% Flowchart Load to Score
flowchart LR
A[Load Data] --> B(Clean Data)
B --> C(Encode Categorical Data)
C --> D(Classify/Select Columns )
D --> E(Split Data for train/test)
E --> F(Training Data)
F --> G(Train Model)
E --> H(Testing Data)
G --> I(Test Model)
H --> I(Test Model)
I --> J(Score Model)
read and format data
#define columns we will use for training and testing, x and ycolumns =['nbhd', 'quality', 'stories', 'gartype', 'numbaths', 'arcstyle']columns_to_encode =['quality', 'gartype', 'arcstyle']x = df[columns]y = df['pre-1980'] # encode columnsx_encoded = xfor c in columns_to_encode: x_encoded = encode_column(x_encoded,c)#create the modelmodel = DecisionTreeClassifier()model.fit(x_encoded,y)#identify important featuresselected_model = SelectFromModel(model, prefit=True)x_encoded_selected = selected_model.transform(x_encoded)# create model for the selected setmodel_selected_by_model = DecisionTreeClassifier()model_selected_by_model.fit(x_encoded_selected,y)# create empty lists for returned accuracy, precision and feature pct# 6 columnsresults_accuracy_6columns = []results_precision_6columns = []results_feature_pct_6columns = []# selected by modelresults_accuracy_selected_by_model = []results_precision_selected_by_model = []results_feature_pct_selected_by_model = []# run the test n times, store the data against the selected model and encoded model.result_count =25row_count =int(math.sqrt(result_count))model_list = [[model, [ results_accuracy_6columns, results_precision_6columns, results_feature_pct_6columns], x_encoded ], [model_selected_by_model, [ results_accuracy_selected_by_model, results_precision_selected_by_model, results_feature_pct_selected_by_model], x_encoded_selected ] ]whilelen(results_accuracy_6columns) < result_count:for m, datasets, column_list in model_list:#split the data x_train, x_test, y_train, y_test = train_test_split(column_list,y)# run the fit and score accuracy_result, precision_result, feature_result = train_test_model(m, x_train, x_test, y_train, y_test)# append results datasets[0].append(accuracy_result) datasets[1].append(precision_result) datasets[2].append(feature_result)# test accuracyx_train, x_test, y_train, y_test = train_test_split(x_encoded,y)
Based on analysis of individual column score results (I ran evaluations against each column’s scoring accuracy, and then I retained all columns that were greater than 10%), the columns initial columns selected were:
Data cleaning was applied to floorlvl and gartype to deal with NaNs. Because we have categorical data in the dataset, I then ran the dataframe through an encoder to translate to numeric values, which increased the column count from 6 to 313.
However, I was unhappy with OneHotEncoder creating new columns, and renaming existing columns, so I built my own encoder function that identified the unique values in a column, and translated it to numerical data.
_The data set was then run through the FeatureSelection.SelectFromModel algorithm, which reduced the columns from 6 down to just 2.
After trying the linear, random forest, and Gaussian Naive Bayesregressors, I realized that a better solution was using a classifier instead. The Linear Classifier provided about 70% accurate, GaussianNB was 80%, but quite slow. Random Forest was also slower. I ended up settling on the Decision Tree Classifier.
The data was then split into training and test segments using the train\_test\_split method.
Task 3| Model Justification
Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.
Feature Importance Data
justify model
# create a dataset for featuresfeatures = pd.DataFrame(results_feature_pct_6columns)features.loc['mean'] = features.mean()features.columns =list(x_encoded.columns)# display feature datafeatures.style
nbhd
quality
stories
gartype
numbaths
arcstyle
0
0.360224
0.126282
0.017360
0.129387
0.035075
0.331673
1
0.519572
0.139913
0.013745
0.052188
0.039188
0.235394
2
0.516576
0.143723
0.013133
0.057958
0.038751
0.229858
3
0.358933
0.126533
0.017651
0.128062
0.038458
0.330362
4
0.360980
0.127580
0.018172
0.132058
0.035490
0.325718
5
0.340862
0.135930
0.015360
0.133062
0.040270
0.334515
6
0.531246
0.139200
0.016625
0.052877
0.034176
0.225875
7
0.356737
0.124377
0.014133
0.131267
0.038277
0.335208
8
0.362125
0.122403
0.016248
0.132522
0.034357
0.332345
9
0.356891
0.131319
0.015645
0.129914
0.031498
0.334732
10
0.355500
0.128542
0.013913
0.130186
0.038235
0.333624
11
0.358372
0.130637
0.013842
0.128801
0.035390
0.332959
12
0.357029
0.126803
0.016136
0.130869
0.036805
0.332358
13
0.361880
0.130271
0.012746
0.127244
0.038251
0.329607
14
0.524180
0.143386
0.013325
0.051024
0.033886
0.234200
15
0.520211
0.146053
0.012482
0.052595
0.037393
0.231266
16
0.529128
0.148052
0.013158
0.046581
0.033783
0.229298
17
0.357628
0.124828
0.015647
0.132065
0.032444
0.337388
18
0.526618
0.140767
0.012304
0.052298
0.039346
0.228666
19
0.523415
0.142382
0.012330
0.049365
0.035649
0.236859
20
0.522306
0.216105
0.017202
0.052690
0.035298
0.156398
21
0.518333
0.145332
0.013553
0.054424
0.038364
0.229993
22
0.353207
0.134220
0.017592
0.129613
0.034953
0.330414
23
0.360086
0.131056
0.014249
0.128869
0.038222
0.327520
24
0.523781
0.216980
0.013594
0.049100
0.036952
0.159593
mean
0.430233
0.140907
0.014806
0.095801
0.036420
0.281833
Feature Importance Summary
To reduce variance between training runs with unique data sets, I ran 25 unique data set splits and generate feature data, we see that neighborhood and architecture style are well above the other feature importance values. Quality and garage type are also reasonbly important.
justify model
# create dataframe for showing pie chartfeatures_means = features.mean()features_pie = pd.DataFrame(zip(list(x_encoded.columns),list(features_means)))features_pie.columns = ["feature", 'percentage']# show pie chartfeature_chart = px.pie(features_pie,values='percentage', names ='feature')feature_chart.show()
Task 4| Model Quality
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
Accuracy Scoring Data for Model Selection Columns
statistical summary for selected columns
# create a dataframe from the result for both the 6 column and selected columnsresults_df = pd.DataFrame(results_accuracy_6columns)results_df.columns = ['score']results_df_selected = pd.DataFrame(results_accuracy_selected_by_model)results_df_selected.columns = ['score']# reshape the datapoints for a grid display df_grid = pd.DataFrame(results_df.to_numpy().reshape(row_count,row_count))df_grid_selected = pd.DataFrame(results_df_selected.to_numpy().reshape(row_count,row_count))# create Title# set colorcm = sns.light_palette("blue", as_cmap=True)#show tabledf_grid_selected.style \ .hide(axis='columns') \ .format(precision=3) \ .background_gradient(cmap=cm) \ .set_table_styles([{'selector': 'caption','props': [ ('color', 'blue'), ('font-size', '25px') ] }])
# describe the statistical data, and transpose for displaydescribed_data = results_df.describe().transpose()[['count','mean','std','min','max']]described_data = described_data.rename(columns={'std':'standard deviation'})described_selected_data = results_df_selected.describe().transpose()[['count','mean','std','min','max']]described_selected_data = described_selected_data.rename(columns={'std':'standard deviation'})# create statistical data for use in narrativemean =round(float(described_data['mean'].to_string().split()[1]),3)standard_deviation =round(float(described_data['standard deviation'].to_string().split()[1]),3)min_value =round(float(described_data['min'].to_string().split()[1]),3)max_value =round(float(described_data['max'].to_string().split()[1]),3)mean_selected =round(float(described_selected_data['mean'].to_string().split()[1]),3)# show chartdescribed_data.style.format({"count" : "{:,.0f}","mean" : "{:.3f}","standard deviation" : "{:.3f}","min" : "{:.3f}","max" : "{:.3f}" }) \ .set_table_styles([{'selector': 'caption','props': [ ('color', 'blue'), ('font-size', '25px') ] }])
count
mean
standard deviation
min
max
score
25
0.948
0.003
0.943
0.953
I ran the train_test_split method, the fit method and finally score method against this column set 25 times to get a statistically significant data set. The first data set is from using just two columns as selected by the SelectFromModel selecter. The second is from using the 6 columns I had originally used.
The accuracy results came back on the 2 column data set at 0.913 . Comparing the results to my initial 6 column data set, the data shows that using 6 columns instead of 2 columns increases to 0.948.
In addition, the standard deviation across our samples was tiny at 0.003. Min and Max across our data was 0.943 and 0.953, respectively.
_The resulting model will successfully determine pre 1980 homes with a mean accuracy rate of 0.948. The 95% confidence interval would be (0.942, 0.954).
Accuracy is calculated as follows:
\[\begin{align}
Accuracy& = {R_c \over T_t }\\
where\\
R_c& = Correct Responses\\
T_t& = Total Test Cases\\
\end{align}\]
Precision Scoring Data for 6 Columns
precision statistical summary for 6 columns
# create a dataframe from the result for both the 6 column and selected columnsresults_df = pd.DataFrame(results_precision_6columns)results_df.columns = ['score']# reshape the datapoints for a grid display df_grid = pd.DataFrame(results_df.to_numpy().reshape(row_count,row_count))#show tabledf_grid.style \ .hide(axis='columns') \ .format(precision=3) \ .background_gradient(cmap=cm) \ .set_table_styles([{'selector': 'caption','props': [ ('color', 'blue'), ('font-size', '25px') ] }])
0
0.970
0.967
0.961
0.962
0.958
1
0.965
0.968
0.965
0.963
0.969
2
0.963
0.967
0.964
0.970
0.963
3
0.961
0.967
0.965
0.967
0.959
4
0.967
0.964
0.966
0.960
0.957
Precision Summary Analysis
precision statistical summary 2
# describe the statistical data, and transpose for displaydescribed_data = results_df.describe().transpose()[['count','mean','std','min','max']]described_data = described_data.rename(columns={'std':'standard deviation'})described_selected_data = results_df_selected.describe().transpose()[['count','mean','std','min','max']]described_selected_data = described_selected_data.rename(columns={'std':'standard deviation'})# create statistical data for use in narrativemean =round(float(described_data['mean'].to_string().split()[1]),3)standard_deviation =round(float(described_data['standard deviation'].to_string().split()[1]),3)min_value =round(float(described_data['min'].to_string().split()[1]),3)max_value =round(float(described_data['max'].to_string().split()[1]),3)mean_selected =round(float(described_selected_data['mean'].to_string().split()[1]),3)# show chartdescribed_data.style.format({"count" : "{:,.0f}","mean" : "{:.3f}","standard deviation" : "{:.3f}","min" : "{:.3f}","max" : "{:.3f}" }) \ .set_table_styles([{'selector': 'caption','props': [ ('color', 'blue'), ('font-size', '25px') ] }])
count
mean
standard deviation
min
max
score
25
0.964
0.004
0.957
0.970
The precision results came back on the 6 column data set at 0.964. Precision is calculated as below. Precision is useful as an indicator to ensure that we are not missing a significant numbers of false_positives. Our precision data here is excellent, even better than our accuracy.